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Abstract 

Background: Comparison of the human genome with other primates offers the opportunity to detect 
evolutionary events that created the diverse phenotypes among the primate species. Because the primate 
genomes are highly similar to one another, methods developed for analysis of more divergent species do not 
always detect signs of evolutionary selection. 

Results: We have developed a new method, called DivE, specifically designed to find regions that have evolved 
either more or less rapidly than expected, for any clade within a set of very closely related species. Unlike some 
previous methods, DivE does not rely on rates of synonymous and nonsynonymous substitution, which enables it 
to detect evolutionary events in noncoding regions. We demonstrate using simulated data that DivE compares 
favorably to alternative methods, and we then apply DivE to the ENCODE regions in 14 primate species. We 
identify thousands of regions in these primates, ranging from 50 to >10000 bp in length, that appear to have 
experienced either constrained or accelerated rates of evolution. In particular, we detected 4942 regions that have 
potentially undergone positive selection in one or more primate species. Most of these regions occur outside of 
protein-coding genes, although we identified 20 proteins that have experienced positive selection. 

Conclusions: DivE provides an easy-to-use method to predict both positive and negative selection in noncoding 
DNA, that is particularly well-suited to detecting lineage-specific selection in large genomes. 



Background 

The genome of a living species is the product of a long 
series of changes, including neutral, beneficial, and detri- 
mental alterations to the sequence. Sequence changes 
that affect the organism's fitness are subject to evolu- 
tionary pressures, such as the pressure to survive, to 
out-compete other species, and to defend the organism 
against external attack. In order to uncover these 
changes, we need to know what the ancestral genome 
looked like, which we can infer by comparing multiple 
genomes to one another. As we accumulate genomes 
from species related to human, and especially from 
within the primate lineages, we should be able to learn 
more about what makes humans special. At the same 
time, we can learn what makes each primate different 
from the others. Until recently, methods for detecting 
the effects of evolution had been designed for relatively 
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distant species such as humans and mice. With the pub- 
lication of the chimpanzee genome [1], we had our first 
look at a very close relative of human. The genomes of 
chimpanzees and humans are so close, in fact, that 
sequence similarity cannot be used to infer functional 
significance: in most cases, similarity simply reflects the 
recent divergence between the species. With more spe- 
cies, sequence comparison even among close relatives 
can be used to tease apart regions that are constrained 
by evolutionary forces and that, consequently, are likely 
to have functional importance to the biology of humans. 

Recently, the ENCODE project selected 13 primates 
(in addition to human) and sequenced 1% of each gen- 
ome to produce "comparative grade" [2] assemblies. 
These high-quality sequences from close human rela- 
tives give us a greater ability than before to detect the 
signs of evolutionary selection on the human genome 
and other primates. The traces of evolution's effects can 
be found more easily when they are shared among mul- 
tiple species. Signs of selection also may indicate func- 
tionally important sequences, and in particular they can 
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be used to identify regulatory regions that fall outside 
protein-coding regions and are otherwise difficult to 
find. 

Broadly speaking, there are two main types of selective 
processes driving the evolution of genomes. Negative or 
purifying selection is the evolutionary pressure that 
eliminates deleterious mutations from a population. 
Most mutations in the genome are probably neutral, 
because most of the genome is itself non-functional, but 
within coding regions, the majority of mutations are 
deleterious [3]. Deleterious mutations are likely to be 
transient; i.e., they do not become fixed in the human 
population. Negative selection has been identified prin- 
cipally by pairwise sequence alignment methods, 
through which DNA or amino acid sequences can be 
shown to be more highly conserved than expected based 
on the overall evolutionary distance between a pair of 
species. By one well-known estimate, approximately 5% 
of the human genome is under negative selection [4], of 
which only 1.5% is contained in protein-coding exons. 

Positive selection is more difficult to detect. In positive 
selection, a region of the genome, protein coding or 
otherwise, accumulates beneficial mutations that provide 
a survival advantage to the organism. One way to detect 
positive selection is by the presence of genes that have 
acquired many more mutations than other genes when 
compared to close relatives. A well-documented exam- 
ple of positive selection is the rapid change in the 
hemagglutinin protein on the surface of the human 
influenza virus, which is in constant competition with 
the human immune system [5]. Positive selection must 
be carefully distinguished from the relaxation of selec- 
tive constraints, however. If a sequence (a gene or a reg- 
ulatory sequence) ceases to perform its function, and if 
that function is no longer needed by the organism, then 
it might accumulate mutations faster precisely because 
it is no longer functional. 

In this study, we describe a new method, called DivE, 
for detecting lineage-specific regions evolving at a 
slower or faster rate than the background evolutionary 
rate in the primate genomes. Other methods have been 
previously developed for detecting selection, but most 
look only at conservation of sequence (negative selec- 
tion) in all aligned species, and are not lineage specific 
[6-12]. Methods to detect accelerated regions (i.e. 
regions evolving at faster-than-neutral rates) have also 
appeared recently [13-18]. Some of these methods allow 
for lineage-specific selection [14-16,18], but in contrast 
with conservation-detection methods, they cannot be 
easily used for genome-wide scans to detect selection, 
and look only at particular regions of interest. Although 
accelerated regions may indicate positive selection, this 
is not necessarily the case [19]. There are many exam- 
ples where positive selection manifests itself at only a 



small number of sites [20-23]. Our method is not suited 
to the identification of positive selection in these cases. 

Recently a new program, phyloP, was developed to 
examine the more general problem of detecting either 
conserved or accelerated regions in a set of aligned 
orthologous sequences from multiple species [24]. Phy- 
loP implements four different statistical phylogenetic 
tests to find significant departures from non-neutral 
substitution rates on a whole phylogeny as well as on 
selected subtrees (clades) of interest in the phylogeny. It 
was shown to have fairly good accuracy in detecting 
strong selection even at individual nucleotides. In one 
respect, DivE is similar to phyloP in that both methods 
try to solve the general problem of detecting an increase 
or a decrease in the rate of substitution in a given geno- 
mic region, either on a whole phylogeny or within a 
clade of the phylogeny. 

However, in phyloP the phylogenetic subtree of 
interest needs to be provided to the program, while in 
contrast DivE addresses the more complicated problem 
in which the lineage of interest is not pre-specified. 
Therefore the lineage under selection must be detected 
automatically by DivE from among all possible subtrees 
within a phylogeny. Another significant difference is 
that applying phyloP to an entire genome to detect 
selection involves using a sliding window approach. 
Although a sliding-window analysis is a popular 
method to test for negative or positive selection, there 
are results that show that this approach is not gener- 
ally valid if selective trends are not known a priori in a 
given region [25]. In addition, the sensitivity of phyloP 
is dependent on the size of the window used to scan 
the genome, which in turn depends on the number of 
species available. DivE doesn't use a sliding window 
approach, but instead tries to determine the optimal 
size for the selected genomic element that is predicted 
to be under selection. In regard to these differences, 
DivE is more similar to DLESS [26], a method that 
detects sequences that have either come under selec- 
tion, or begun to drift, in any lineage. While DLESS 
only allows for detection of a "gain" event (conserva- 
tion in a phylogenetic subtree) or a "loss" event (where 
a subtree is evolving neutrally while the rest of the 
tree is conserved), DivE also detects acceleration events 
in any clade of the tree. DLESS is the only other com- 
putational method, prior to DivE, that can detect line- 
age-specific selection when the lineage of interest is 
not pre-specified. 

Below we present our method for detecting both con- 
served and accelerated regions and apply it to 14 pri- 
mate genomes. We describe results on simulated and 
real data, including the identification of positively 
selected genes that intersect regions evolving faster than 
the neutral mutation rate. The method described in this 



Pertea et al. BMC Bioinformatics 201 1, 12:274 
http://www.biomedcentral.eom/1 471 -21 05/1 2/274 



Page 3 of 1 3 



paper is implemented in the DivE package which is 
available as free, open-source software [27]. 

Results 

Simulation results 

For our simulation tests, we created sequence elements 
that were both positively and negatively selected within 
the same 14 primate species used for our later experi- 
ments on real data. Because we knew the precise loca- 
tion, size, and type of selection involved in each 
element, we could use this data to evaluated the accu- 
racy of DivE and compare it to other methods. 

We created simulated data sets that contain selected 
elements of lengths between 50 bp and 1000 bp in all 
subtrees of the phylogeny of the 14 primates (see Fig- 
ure 1 and Methods for a description of the primate 
phylogeny). Conserved elements are either "gained" or 
"lost" on a particular lineage, where a "gain" event 
implies that the region defined by that particular line- 
age will experience selective pressure that will tend to 
eliminate individuals with mutations in that region (i. 
e., negative or purifying selection). A "loss" event 



implies that the region in question does not have evo- 
lutionary constraints, and will evolve at the neutral 
substitution rate, while the rest of the tree is con- 
strained. The average substitution rate observed for 
conserved elements is a fraction of the non-conserved 
regions, and we therefore can simulate negative selec- 
tion by reducing the branch lengths of the selected 
subtree (for gain) or supertree (for loss), as depicted in 
Figure 2. For accelerated elements, the observed substi- 
tution rate is greater than the neutral rate. A special 
case of accelerated evolution is positive selection, 
which occurs when a sequence is under pressure to 
change more rapidly; e.g., in order to adapt to changes 
in the environment. A particular subtree might be 
under positive selection if the branch from the whole 
tree leading to that subtree is elongated, while the 
branches within the subtree are the same or shorter 
than the background mutation rate (see Figure 2). 

In our simulation, the accelerated elements in a given 
subtree are generated according to a phylogeny in which 
the parent edge of the node at the root of the subtree is 
longer. We chose a scaling parameter p (where p e {0.01, 
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Figure 1 Phylognetic tree of the fourteen primates used in this study. The branches in the tree are proportional to the expected number 
of substitutions per site. For reference, a scale of 0.01 substitutions per site is shown. The names of the major clades in this phylogeny are 
shown in italics. All nodes have 100% support values in a Bayesian phylogeny framework, except for node a. 
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supertree ("gain") supertree ("loss") acceleration 

Figure 2 The selection problem. Branch length changes to the neutral phylogenetic tree are shown for each type of selection considered in 
this study: gain, loss, and acceleration. The figure shows a phylogenetic tree composed of a supertree T with root A, containing a subtree t with 
root A' that is affected by selection. The left side of the figure shows the effects of negative selection on the relative shape and position oft. In 
the case of "gain" elements the branches within t as well as the branch from A to A' are reduced in length. In the case of "loss" elements, t is 
unaffected while the rest of the tree is reduced in size (conserved). Both gain and loss events are particular cases of negative selection affecting 
different lineages in the tree. In the case of acceleration, shown on the right side of the figure, only the branch from A to A' increases in length. 
Positive selection is one cause, but not the only cause, of this type of increase. Branches within t may decrease in length if there is additional 
selection within the tree. 



0.02, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5}) to represent the selection 
strength affecting the elements by reducing (through 
multiplication by p) or enlarging (by dividing by p) differ- 
ent branches of the phylogenetic tree depending on the 
type of section [28] as described above. We obtained the 
results presented here by generating 100 elements for 
each subtree in the phylogeny, type of selection, selection 
strength, and length. 

We used DivE to detect regions that have either come 
under negative selection, or begun to drift on any line- 
age. We compared its results to DLESS [26], the only 
previous method to our knowledge that has been devel- 
oped to solve the same problem as the one presented 
here. The same input was given to both DivE and 
DLESS: a set of aligned orthologous sequences and a 



neutral phylogenetic model generated with phyloFit [29], 
which included a substitution rate matrix, a tree with 
branch lengths in units of expected substitutions per 
site, and estimates of nucleotide equilibrium frequencies. 

Table 1 shows the average accuracies obtained by 
DivE and DLESS for the prediction of elements of dif- 
ferent lengths and at different selection strengths (as 
specified by the scaling parameter p) that are either 
gained or lost. Averages are computed on all clades in 
the phylogeny (excluding the whole tree). The accuracy 
of prediction for each program is computed as the aver- 
age of the precision and recall to detect sites under 
selection. Precision and recall performances for detect- 
ing negative selection in all subtrees for both DivE and 
DLESS are shown in Additional file 1, Tables S1-S16. 



Pertea et al. BMC Bioinformatics 201 1, 12:274 
http://www.biomedcentral.eom/1 471 -21 05/1 2/274 



Page 5 of 1 3 



Table 1 Average accuracy (shown as a percentage) of DivE (dE) and DLESS (dl) to detect conservation and accelerated 



evolution computed on all primate subtrees for simulated data of different lengths. 
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Accuracy was computed as the average of precision and recall. The first column in table shows the type of selection (gain, loss, or acceleration), and the value of 
the scaling parameter p. Smaller values of p correspond to stronger selection pressure, which is easier to detect. Results in bold show the larger accuracy value 
for each comparison between DivE and DLESS. 



In almost all cases, DivE's accuracy is significantly 
greater than that obtained by DLESS. In the few cases 
where DLESS obtains a better accuracy, this accuracy 
is less than 0.5% greater than that of DivE. As 
observed before [26] the power to predicting elements 
that are lost is greater than the power to predicted ele- 
ments that are gained. Actually, in the case of lineage- 
specific "gains" (the upper third of Table 1), the aver- 
age accuracy only surpasses 50% for elements longer 
than 500 bp. 

Figure 3 shows that accuracy can be greater when 
considering specific clades with more species, and 
longer branch lengths. In fact the power of detection 
gets significantly higher if the clade has at least two spe- 
cies compared to the power of detection in a single spe- 
cies (see Additional file 1, Tables S1-S8). In the case of 
"losses," the power is generally quite good especially for 
elements of 100 bp and greater. 



Table 1 also shows the accuracy obtained by DivE on 
predicting elements that are accelerated, a function lack- 
ing in DLESS. There is no program that we are aware of 
that is designed to detect both the accelerated element 
as well as the specific phylogenetic lineage in which the 
acceleration takes place. It is interesting to note that the 
power of detecting accelerated elements is significantly 
greater than the power to predict "gained" elements, 
although not as good as the power to detect "lost" ele- 
ments (see also Additional file 1, Tables S17-S24). 

We should also note that by design DivE cannot dis- 
tinguish between accelerations in the simian clade and 
accelerations in the prosimian clades. This is due to the 
symmetry of the phylogenetic tree in which enlarge- 
ments in the branches from the phytogeny' s root to the 
simians and to the prosimians are indistinguishable. If 
acceleration in any of these clades was detected, we 
assumed a 50% chance that the acceleration was in 
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Figure 3 Accuracy of gain and loss predictions on the major primate clades for negatively selected regions. All regions are 200 bp in 
length. DivE is shown with a continuous line, and DLESS with a dotted line. On the x axis is the value of the scaling parameter p determining 
the selection strength, while the accuracy (the average of precision and recall) is shown on the y axis. 



either clade. In real applications, an outgroup can be 
used to distinguish between these two cases. 

Both DivE and DLESS can predict conservation not only 
on subtrees of the phylogenetic tree, but on all branches 
of the phylogeny as well For this task, DivE's accuracy is 
again significantly better than that of DLESS, as shown in 



Table 2. In this case, the power of detecting elements 
under negative selection is quite good even for elements 
as small as 50 bp, when p < 0.2. The accuracy of detection 
increases significantly for elements bigger than 100 bp, 
and is over 90% for p = 0.3, a value that has been shown 
to be specific to conserved regions in vertebrates [10]. 
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Table 2 Accuracy of DivE (dE) and DLESS (dl) on 
detecting negative selection in all primates 
using simulated data. 



selection 50 bp 100 bp 200 bp 500 bp 1000 bp 
strength 

dE dl dE dl dE dl dE dl dE dl 

0.01 93.6 92.3 97.3 88.2 98.2 94.2 99.3 94.3 99.6 96.7 

0.02 93.4 86.2 96.7 92.3 98.3 91.4 99.3 93.2 99.7 94.1 

0.05 94.2 86.0 95.6 87.8 98.2 95.2 99.4 95.7 99.6 98.6 

0.1 94.9 85.5 95.9 90.9 98.1 92.9 99.2 95.1 99.6 96.1 

0.2 79.1 64.3 95.3 94.5 97.8 92.8 99.0 94.7 99.5 95.0 

0.3 64.4 47.4 93.3 87.2 95.7 95.2 98.5 91.7 99.2 96.3 

0.4 49.7 23.4 75.2 65.0 93.6 89.8 97.7 91.6 98.8 91.2 

0.5 37.5 8.4 62.7 46.2 85.5 70.0 94.9 83.6 97.2 87.1 



Results in bold show the greater accuracy. 

Results on fourteen primate species 

Next we applied DivE to the multiple alignments of the 
44 ENCODE regions [30] of primates (see Methods). 
These regions cover -1% of the human genome 
sequence and represent the largest mammalian com- 
parative data set yet published [31]. For our analysis, we 
focused exclusively on the primate sequences because 
DivE is designed to detect selection in closely related 
groups of species. 

When run on all primate ENCODE regions, DivE pre- 
dicted 21,633 elements with non-neutral substitution 
rates. These elements had lengths ranging from 50 bp 
to 40,000 bp, with an average length of 700 bp. Of all 
predicted elements, 12,385 (covering -28% of the 32.1 
million columns of the ENCODE regions' alignment) 
showed negative selection on either all or a subset of 
the branches of the phylogenetic tree. This coverage is 
even higher in protein coding regions and UTRs, where 
it reaches 74%, and 49% respectively. About 65% of the 
predicted sites undergoing negative selection were either 
fully conserved or lineage-specific gains, while the rest 
of them were lineage-specific losses. 

DivE predicted 9248 lineage-specific acceleration 
events, covering 20% of all columns in the ENCODE 
alignment of the primates. Given its reported power to 
detect acceleration for a given clade [24], we ran phyloP 
to validate our predictions. PhyloP reported p-values of 
0.05 or less for 84% of our predicted accelerated ele- 
ments. By comparison, phyloP reported a p-value below 
0.05 for 98% of the elements that are predicted to be 
under negative selection, either within a subtree or on 
all branches of the phylogeny. This suggests that DivE 
may have somewhat less power to detect acceleration as 
opposed to conservation. Additional file 1, Table S25 
shows the exact percentages of the ENCODE primate 
sequences predicted to be either conserved or acceler- 
ated by DivE. 



We were particularly interested in regions for which 
we can find evidence of positive selection. As a stricter 
criterion for positive selection, we looked at the 9248 
accelerated regions and asked whether, in each case, the 
subtree involved was internally conserved; i.e., after the 
acceleration event, the sequences were conserved more 
than expected (see Figure 2). This filter eliminates cases 
of apparent acceleration that might be artifacts of low 
quality sequence or mis-alignment. DivE predicted 4942 
regions (about 53% of all predictions) to be lineage-spe- 
cifically accelerated (p-value < 0.05 as determined by 
phyloP) where the regions also appear conserved when 
looking only at the subtree for that particular lineage. 
(These regions include both one-node subtrees and 
clades of 2 or more species). A list of all elements that 
are potentially undergoing positive selection, as well as 
all elements predicted by DivE can be downloaded from 
ftp://cbcb.umd.edu/pub/data/DivE. 

Positive selection (PS) is one of the most important 
evolutionary forces that shape our development, and 
many efforts have been made to detect its presence in 
protein-coding genes. The most widely used methods is 
to determine if the ratio of nonsynonymous (dN) to 
synonymous (dS) substitutions is larger than 1 [32]. 
This only works in protein-coding regions, while DivE 
considers all regions, but we wanted to compare its pre- 
dictions in protein-coding regions with those made by a 
standard method. 

We used the dN/dS ratio estimated with the PAML 
package [33] to test for positive selection (PS) in primate 
genes from the ENCODE regions that overlap DivE's 
predicted accelerated elements. There are 443 human 
genes with annotations in the ENCODE regions. Out of 
these, 182 genes overlap with accelerated elements pre- 
dicted by DivE. Because the other primates were not 
annotated, we used Jigsaw [34] for each human gene 
(see Methods) to determine its equivalent annotation in 
the other primates. Similarly to previous work [35], to 
minimize the false positive rate when predicting positive 
selection, we accepted as valid only transcripts that 
aligned to at least 80% of the human CDS, without 
frame shifts. At each gene locus we only retained the 
transcripts that had the longest overlap with the acceler- 
ated predicted element. By selecting for each gene the 
transcript that was conserved in most species, and using 
the longest coding length in deciding ties, we obtained 
80 genes that mapped to at least two other species 
besides human and overlapped the predicted accelerated 
elements. This procedure resulted in an average of 10 
orthologous transcripts for each gene. 

Using PAML to test for PS on the branches predicted 
as accelerated by DivE, we determined 25 genes with p- 
value < 0.05. After multiple testing correction, 20 genes 
were identified as positively selected genes (PSGs). 
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Table 3 List of positively selected genes that intersect DivE's predicted accelerated regions. 



Gene 
symbol 


Gene full name 


Encode 
region 


Subtree 

under 

selection 


No. of 

orthologous 

sequences 


P-value 


CPDDIMR7 

jtnrlNb/ 


serpin peptidase inhibitor, clade B (ovalbumin), member 7 


ti\ir 1 ZZ 


new world 
monkeys 


I 5 


<.UUU I 


SLC22A4 


solute carrier family 22 (organic cation/ergothioneine transporter), 
member 4 


ENm002 


orangutan 


14 


<.0001 


UGT1A8 


UDP glucuronosyltransferase 1 family, polypeptide A8 


ENrl 31 


baboon 


7 


<.0001 


LILRA4 


leukocyte immunoglobulin-like receptor, subfamily A (with TM 
domain), member 4 


ENm007 


human-orangutan 


6 


<.0001 


AP0A4 


apolipoprotein A-IV 


ENm003 


macaque 


13 


<.0001 


nbUI 


hemoglobin, theta 1 


LNlTlUUo 


chimp 


1 4 


<.0001 


r / 


coagulation factor VII (serum prothrombin conversion accelerator) 


ti\ir 1 DZ 


macaque 


1 1 
l l 


<.UUU I 


Lz lorn j 
(LCA5L) 


Leber congenital amaurosis 5-like 


ti\ir i jj 


mouse lemur 


1 3 


n nnn/i 
U.UUU4 


LCj/V 


cingulin 


LNrzo I 


dusky titi 


1 3 

I D 


n nnn/i 
U.UUU4 


LILRB4 


leukocyte immunoglobulin-like receptor, subfamily B 
(with TM and ITIM domains), member 4 


ENm007 


vervet-baboon 


6 


0.0007 


ALUUOyoD.D 
(HEATR7B1) 


HEAT repeat containing 7B1 


ENrl 31 


prosimians 


9 


0.0007 


AnrlLjUILj 


Rho GDP dissociation inhibitor (GDI) gamma 


tNmUUo 


simians 


I U 


n nm ~i 
U.UU I / 


F8 


coagulation factor VIII, procoagulant component 


ENm006 


owl monkey 


10 


0.0026 


SYT8 


synaptotagmin VIII 


ENm011 


new world 
monkeys 


8 


0.0030 


C22orf30 
(PRR14L) 


proline rich 14-like 


ENm004 


simians 


12 


0.0039 


HBZ 


hemoglobin, zeta 


ENm008 


macaque 


4 


0.0043 


LILRB5 


leukocyte immunoglobulin-like receptor, subfamily B 
(with TM and ITIM domains), member 5 


ENm007 


macaque 


3 


0.0068 


DEPDC5 


DEP domain containing 5 


ENm004 


human-colobus 
monkey 


12 


0.007 


RPS9 


ribosomal protein S9 


ENm007 


human-colobus 
monkey 


13 


0.0227 


DDX43 


DEAD (Asp-Glu-Ala-Asp) box polypeptide 43 


ENr223 


new world 
monkeys 


12 


0.0246 



Positive selection was determined with the codeml program from the PAML package. The first column gives the ENCODE gene symbol followed in parentheses 
by the corresponding Genbank gene symbol (if different). The fourth column in the table shows the name of the subtree predicted to be accelerated by DivE. If 
there are multiple species in the selected tree, the clade is specified by the two species whose most common ancestor form the clade's root, or by the clade's 
name as specified in Figure 1. 



These genes, shown in Table 3, represent only 25% of 
our initial 80 genes; i.e., only one in four genes that 
were found to intersect accelerated elements were pre- 
dicted as PSGs. The remaining genes might be false 
positives found by DivE, or alternatively, they might 
have been missed because fewer sites among their 
CDS's were found to be accelerated than for the 25% 
that were consistent with the PAML analysis of PS (246 
bp vs 383 bp in average), and PAML itself may have less 
power to determine PS on whole coding regions where 
only small fractions of them show acceleration. Also, as 
one recent study found [36], positive selection may be 
limited to specific regions of genes that are otherwise 
conserved. As mentioned above, the subtrees tested for 



the presence of PS were the ones predicted by DivE to 
be under acceleration. In cases where the acceleration 
was predicted in either the simians or the prosimians 
clade, we tested both clades for the presence of PS, and 
we reported the one with the highest p-value. It should 
be noted though that in most cases PAML finds very 
close p-values for both clades. 

Finally we looked at gene ontology (GO) [37] cate- 
gories associated with the identified PSGs. The two 
most common GO categories were GO:0016021 ('inte- 
gral to membrane') and GO:0005886 ('plasma mem- 
brane'). The first category has been previously found to 
be over-represented among mammalian genes predicted 
to be under positive selection [35]. In fact 8 out of our 
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20 PSGs are associated with over-represented GO cate- 
gories among mammalian PSGs (see Additional file 1, 
Table S26). 

Discussion 

In this study we introduced DivE, a new method to 
detect lineage-specific selection in a group of closely 
related species. In contrast to most previous methods, 
our approach does not restrict its search to selection 
events in a particular lineage, but rather tries to discover 
the particular clade undergoing selection as well as the 
type of selection. We should note that DivE does not 
specifically detect positive selection, but rather identifies 
regions undergoing accelerated change. An accelerated 
rate of evolution can be used as an indication that posi- 
tive selection has occurred, but it could also be due to 
other factors, including relaxation of selective 
constraints. 

Results on simulated data suggest that this method 
performs as well or better than earlier methods, espe- 
cially for longer sequences, and its performance in dis- 
covering negative selection is comparable to similar 
methods both on real and simulated data. DivE also has 
an advantage in that it does not make any assumptions 
about the expected length of the regions under selec- 
tion, or about the strength of the evolutionary con- 
straint. On the other hand, our method admittedly 
makes the simplifying assumption that the neutral rate 
of point mutation is uniform over the genome (although 
local neutral substitution rates could be estimated a 
priori and provided as input to DivE). Neither does it 
model insertion and deletion events, but rather it 
assumes that data is missing when it encounters a dele- 
tion. Although this assumption is not always realistic, it 
is probably true for most deletions appearing in the 
ENCODE regions of the non-human primates. Never- 
theless these assumptions are likely to influence the 
results. Also, compared to other methods, DivE is not a 
rigorously statistical method, but rather a heuristic 
approach designed to capture regions with scores higher 
than the false positive scores observed in neutrally simu- 
lated sequences whose length and composition is similar 
to the real data. Such regions are then predicted as 
selected. One strength of the method is its fast running 
time for discovering lineage-specific selected regions, 
which then can be further evaluated with statistical 
methods (e.g., phyloP). 

We focused on primate genomes in our study because 
of their obvious relevance to human, and also because 
sequence data is now available for 14 different primates 
for a select set of regions covering approximately 1% of 
each genome. The very close evolutionary relationship 
of these species required the development of a new 
method that could detect signs of possible selection in 



very recently diverged species. We were particularly 
interested in positive selection, because rapidly evol- 
ving regions might be key to explaining the unique 
phenotypic traits that distinguish humans from other 
primates and the primates from one another. The DivE 
program predicted accelerated regions overlapping 182 
genes in the ENCODE regions. Not all evolutionary 
acceleration events are caused by adaptive evolution 
[38], and therefore DivE alone should not be used to 
infer positive selection. Using the ratio of nonsynon- 
ymous to synonymous substitution, we found evidence 
of positive selection in different lineages of the pri- 
mates in 20 genes, representing about 5% of all known 
ENCODE genes that have valid transcripts in at least 
two other primates. These 20 genes represent only 
25% of all the genes predicted to intersect accelerated 
regions. These results agree with an analysis of positive 
selection and acceleration in the evolution of the 
human and chimp genomes [19], where the authors 
found that many of the genes with accelerated rates 
did not show significant signals of positive selection. 
Anyway, as a recent study points out [39], these results 
should be treated with caution because estimates of 
positive selection can be greatly influenced by errors in 
sequencing, annotation and alignment. Although the 
ENCODE regions for these 14 primates are between 
17 and 30 Million bp (see Additional file 1, Table S25) 
they are still not 100% complete [40], and they repre- 
sent only a small fraction of the whole genomes. They 
also contain less than 2% of the approximately 22,000 
known human genes [41]. Therefore the results pre- 
sented here might not extrapolate well to an alignment 
of complete genomes. 

Conclusions 

The evolutionary events relating a group of species can 
often be discerned by searching for patterns in the 
aligned DNA sequences from those species. These pat- 
terns can reveal the critical changes that make each spe- 
cies unique and different from its relatives. Regions of 
strong sequence conservation are a sign of negative (or 
purifying) selection, in which detrimental changes are 
eliminated. Positive selection, in which beneficial 
changes are retained in a species, can appear as a region 
in a multi-species alignment with many more changes 
than expected. Different selective pressures can act on 
different lineages of a phylogeny. Although computa- 
tional methods have been developed to identify regions 
under selection across species or in a particular lineage, 
most of these are limited to protein coding sequences 
only, or to identifying negative selection, or both. Little 
attention has been given to identifying selection pressure 
on both coding and non-coding sequences for any 
branch of a phylogeny. 
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Here we developed a heuristic novel approach that 
can discover either positive or negative selection on any 
lineage in a phylogeny, without making prior assump- 
tions about the strength of the selective pressure. We 
have implemented our method in an open-source soft- 
ware package, called DivE, and applied it to identify 
lineage-specific selection in a group of fourteen 
primates. 

Methods 

Dataset 

We downloaded the 44 ENCODE [30] sequences and their 
multiple sequence alignments for 14 primates (human, 
chimp, orangutan, gibbon, colobus monkey, vervet, 
baboon, macaque, dusky titi, owl monkey, marmoset, 
squirrel monkey, mouse lemur, galago), along with their 
associated neutral evolution phylogenetic model (as 
defined in section 3.2) from the NHGRI site (ftp://kronos. 
nhgri.nih.gov/pub/outgoing/elliott/encode/freeze/JAN- 
2009/). The tree topology, shown in Figure 1, follows the 
NCBI taxonomy (http://www.ncbi.nlm.nih.gov/guide/tax- 
onomy/), while the support values for the nodes are from 
[42,43]. The alignments were generated with the TBA pro- 
gram [44] using the method described elsewhere [31]. The 
neutral model was estimated from all fourfold degenerate 
sites in the ENCODE regions, using the phyloFit program 
[29] from the PHAST package [45], and assuming the gen- 
eral time-reversible (REV) substitution model [46]. 

Finding regions under selection 

We constructed the program called DivE ( Div ergence by 
Evolution) to detect lineage-specific selection by identi- 
fying conserved or accelerated elements specific only to 
a subgroup of species. Given a multiple alignment of 
orthologous sequences, and two phylogenetic models, 
one describing regions evolving under neutral evolution 
and one describing regions under selection in a particu- 
lar lineage, DivE employs a greedy heuristic to deter- 
mine local regions in the alignment where the following 
log ratio score: 



log 



P(lineage specific region under selection) 
P(neutrally evolving region) 



(1) 



is positive and maximal. 

We evaluate the score in equation (1) for every line- 
age in the phylogeny (which corresponds to a single 
node in the phylogenetic model), between any two col- 
umns in the alignment, and for any type of selection. 
If the score for a given lineage in the alignment region 
between the two columns is positive and above some 
fixed threshold, then we consider this an indication 
that the lineage has potentially undergone selection in 
that particular region. The region is discarded in the 



case where it contains another region whose score 
represents a significant deviation from selection 
(according to a previously identified threshold). This 
prevents the algorithm from clustering together nearby 
selected regions. If two candidate regions overlap, we 
predict as under selection only the maximum scoring 
region (see also Figure 4). Below we present our 
method for computing the score in equation (1). 

As in [47], we specify the neutral evolution phylogeny 
of the species through a phylogenetic tree model with 
four parameters: y/ n = (Q, z, {}, n), where Q is a substitu- 
tion rate matrix, z is the topology of the phylogenetic 
tree, /3 is a vector of branch lengths, and tt is a vector of 
equilibrium base frequencies, z consists of a set of 
nodes, V(r), and a set of edges, E(z), connecting the 
nodes. Given a positive number p, and the subtrees z\ z" 
of r, we will denote by p/3 the lengths of branches in 
multiplied by p, by fi\E(z) the lengths of branches in z' 
only, and by j8| E (z) | U /? | E (z") the lengths of 
branches in z and z". If u Lv (z), we denote by z u a sub- 
tree of z rooted at u, and by (u p u) the parent edge of u, 
where u p is the parent node of u. With these notations, 
we can now specify the phylogenetic models governing 
the gain (g), loss (/) and acceleration (a) evolutionary 
events relative to the neutral model \j/ as 

^(p) = (Q,r,p^|E(r\r w )U^|£(r w ),7r), 

f l u (p) = (Q,r, Pi g|E(r\r M )U^|E(r w ),7r), and 

K(P) = (Q f r f p- l p\{up f u)up\E{z\{up f u)) f jT) respec- 
tively, where p e (0,1) is the scaling parameter that 
identifies the selection strength of the model. 

We can now give a more detailed form for the score 
computed by DivE in equation (1). For each two col- 
umns i and j in an alignment of sequences, with i <;, 
selection type s e {g,l,a}, and node u e V (r), DivE com- 
putes the following score: 



S(i,j,s, u) = max 0, £ (logp(fe; f s u (p)) - logp(fe; ^) (2) 




ACGT p?CCGATI] AGAGAGT' 
ACGlTA^TCATTAGAGAGT 1 
CCGTOTCT 1 

Figure 4 Schematic figure of DivE's heuristic approach For 

each type of selection, and each node in the phylogenetic tree, first 
we identify local regions (denoted by shadowed rectangles in the 
figure) that score positively according to a log ratio selection score - 
the more intense the shadow color, the higher the score of that 
region is. Then DivE predicts as "under selection" the maximal 
scoring non-overlapping regions. The arrows in the figure connect 
predicted regions in the alignment. 



Pertea et al. BMC Bioinformatics 201 1, 12:274 
http://www.biomedcentral.eom/1 471 -21 05/1 2/274 



Page 11 of 13 



where p(k;y/) is the probability of the column k in the 
alignment under the phylogenetic model y/. We can 
compute the probability p(k;yj) using the Felsenstein 
algorithm [48]. As shown in equation (2) the scaling 
parameter p is not fixed in the input of the program in 
order to avoid assumptions about the selection strength 
in the observed local alignment. In the actual implemen- 
tation of DivE, for practical purposes, p is not left to 
vary freely in the (0,1) range, but there is a fixed num- 
ber of possible values for p that DivE can choose from. 
This number and the possible values for p can be chan- 
ged in the program's input. The results presented here 
were generated with p allowed to vary in the 
(0.05,0.1,0.2,0.3,0.4,0.5) set. 

If n represents the number of sequences in the align- 
ment, and L is the length of the alignment, than com- 
puting the score in equation (2) between any two 
columns in the alignment would lead to an 0(nL 2 ) run- 
ning time complexity for DivE's heuristics. The actual 
implementation of DivE has a running time of O(nLa) 
where a represents the average length of a selected ele- 
ment, since we don't need to evaluate a score between 
any i and ; (i<j) columns, if the region between the two 
columns contains a subregion that is neutral or shows 
selection of a different type (according to a predefined 
threshold). This linear running time of our heuristics 
makes DivE particularly well-suited to detecting lineage- 
specific selection in large genomes. 

Simulations 

We used the program evolver from the PAML package 
[33], version 3.15, to simulate aligned nucleotide 
sequences evolving either neutrally or under selection. 
Aligned neutral sequences were generated to follow 
the neutral model of evolution described in the Dataset 
section, and they were of the same length (about 32.1 
million bp) and with the same deletion patterns as the 
aligned ENCODE regions of the 14 primates. We used 
them to determine thresholds for DivE to call pre- 
dicted elements with a false positive rate of less than 
0.1%. To simulate sites under selection in a given sub- 
tree of the phylogeny the scaling parameter p<l was 
used to adjust the branch lengths of the phylogeny 
either through multiplication of all branches' lengths 
in the selected subtree (in the case of gain and loss), 
or through division of the length of the parent edge 
from the root of the subtree (in the case of accelera- 
tion). The sites under section were chosen to span 
regions of 50, 100, 200, 500, or 1000 aligned columns 
that were randomly inserted inside neutrally evolving 
region that was twice as long, flanked on either side by 
another 1000 neutral alignment columns. For any 
given subtree, length, and scaling parameter we 



generated 100 such regions containing sites under 
selection, resulting in a total number of 243,000 simu- 
lated regions, that we used to compute the accuracy of 
prediction. The accuracy was computed as the average 
of the recall (the number of sites under selection cor- 
rectly identified, also known as sensitivity) and preci- 
sion (the number of predicted sites that were truly 
under selection. 

Gene annotations for the ENCODE regions 

We downloaded the GENCODE reference set [49] of 
human gene annotations for the ENCODE regions from 
the UCSC genome browser [50], and we aligned the 
known human mRNA sequences from each ENCODE 
region to the corresponding genomic region of every 
primate species. The nucleotide sequences of these 
mRNAs were mapped using GMAP [51] and sim4cc 
[52]. Where available, the coding regions of these 
human transcripts were translated into proteins and 
aligned to the corresponding ENCODE regions in the 
other primates using PMAP (a protein mapping variant 
of GMAP) and exonerate [53]. The resulting spliced 
alignments were integrated using JIGSAW [34], which 
reported the combined transcript structures that had 
valid coding regions. 

Identification of genes under positive selection 

Positively selected genes were identified using the 
codeml program from PAML. As input, we used the 
tree in Figure 1, and the TBA alignments of the genes 
extracted from the ENCODE regions. The same proce- 
dure using the improved branch-site test 2 under 
"model A" described in [54] was used to determine sig- 
nificance p-values for positively selected genes in a spe- 
cific lineage. The codon frequencies were estimated 
from the average nucleotide frequencies at the three- 
codon position (F3 x 4 model). Test 2 has been shown 
to be the more robust in detecting positive selection 
among the other branch-site models implemented in 
codeml The null hypothesis of this test assumes that 
some of the branches are negatively selected and some 
are neutral, while for the alternative hypothesis some 
sites are allowed to undergo positive selection (dN/dS 
>1) along some predefined foreground branch, which in 
our case was selected to be the parent branch of the 
node at the root of the subtree predicted to be under 
acceleration by DivE in the coding portion of the gene. 
The LRT's values resulted from test 2 were compared 
against the distribution X 1 to determine significance p- 
values. Correction for multiple testing was performed 
using the Benjamini and Hochberg procedure [55] using 
a false discovery rate (FDR) of 10% as the tests for PS 
are already very conservative [54]. 
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Additional material 



Additional file 1: Supplementary Tables S1-S26. This document 
includes the accuracy obtained by DivE and DLESS for the prediction of 
all simulated elements, the percentage of the ENCODE primate 
sequences predicted to be either conserved or accelerated by DivE, and 
the GO categories associated with genes predicted to be under positive 
selection. 



Acknowledgements 

We thank Elliot Margulies for providing the ENCODE regions of the 14 
primates together with their alignments and estimated neutral model. This 
work was supported by NIH grant R01-LM006845. 

Author details 

1 Center for Bioinformatics and Computational Biology, University of 
Maryland, College Park, Maryland, USA. 2 McKusick-Nathans Institute of 
Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, 
Maryland, USA. 

Authors' contributions 

MP and SLS designed the DivE method. MP and GP implemented the 
algorithms. GP generated the gene annotations in the ENCODE regions for 
all non-human primates, and MP analyzed the data. MP and SLS wrote the 
paper. All authors read and approved the final manuscript. 

Received: 25 May 201 1 Accepted: 4 July 201 1 Published: 4 July 201 1 
References 

1. The Chimpanzee Sequencing and Analysis Consortium: Initial sequence of 
the chimpanzee genome and comparison with the human genome. 

Nature 2005, 437(7055):69-87. 

2. Blakesley RW, Hansen NF, Mullikin JC, Thomas PJ, McDowell JC, Maskeri B, 
Young AC, Benjamin B, Brooks SY, Coleman Bl, Gupta J, Ho SL, Karlins EM, 
Maduro QL, Stantripop S, Tsurgeon C, Vogt JL, Walker MA, Masiello CA, 
Guan X, Bouffard GG, Green ED: An intermediate grade of finished 
genomic sequence suitable for comparative analyses. Genome Res 2004, 
14(11):2235-2244. 

3. Eyre-Walker A, Woolfit M, Phelps T: The distribution of fitness effects of 
new deleterious amino acid mutations in humans. Genetics 2006, 

173(2):891-900. 

4. Waterston RH, Lindblad-Toh K, Birney E, Rogers J, Abril JF, Agarwal P, 
Agarwala R, Ainscough R, Alexandersson M, An P, Antonarakis SE, 
Attwood J, Baertsch R, Bailey J, Barlow K, Beck S, Berry E, Birren B, Bloom T, 
Bork P, Botcherby M, Bray N, Brent MR, Brown DG, Brown SD, Bult C, 
Burton J, Butler J, Campbell RD, Carninci P, et al: Initial sequencing and 
comparative analysis of the mouse genome. Nature 2002, 

420(691 5):520-562. 

5. Bush RM, Fitch WM, Bender CA, Cox NJ: Positive selection on the H3 
hemagglutinin gene of human influenza virus A. Mol Biol Evol 1999, 

16(11):1457-1465. 

6. Frazer KA, Sheehan JB, Stokowski RP, Chen X, Hosseini R, Cheng JF, 
Fodor SP, Cox DR, Patil N: Evolutionarily conserved sequences on human 
chromosome 21. Genome Res 2001, 1 1 (1 0):1 651-1 659. 

7. Margulies EH, Blanchette M, Haussler D, Green ED: Identification and 
characterization of multi-species conserved sequences. Genome Res 2003, 
13(12):2507-2518. 

8. Boffelli D, McAuliffe J, Ovcharenko D, Lewis KD, Ovcharenko I, Pachter L, 
Rubin EM: Phylogenetic shadowing of primate sequences to find 
functional regions of the human genome. Science 2003, 

299(561 1):1 391 -1394. 

9. Cooper GM, Stone EA, Asimenos G, Green ED, Batzoglou S, Sidow A: 
Distribution and intensity of constraint in mammalian genomic 
sequence. Genome Res 2005, 15(7):901-913. 

10. Siepel A, Bejerano G, Pedersen JS, Hinrichs AS, Hou M, Rosenbloom K, 
Clawson H, Spieth J, Hillier LW, Richards S, Weinstock GM, Wilson RK, 
Gibbs RA, Kent WJ, Miller W, Haussler D: Evolutionarily conserved 



elements in vertebrate, insect, worm, and yeast genomes. Genome Res 
2005, 15(8):1 034-1 050. 

11. Asthana S, Roytberg M, Stamatoyannopoulos J, Sunyaev S: Analysis of 
sequence conservation at nucleotide resolution. PLoS Comput Biol 2007, 
3(12):e254. 

12. Garber M, Guttman M, Clamp M, Zody MC, Friedman N, Xie X: Identifying 
novel constrained elements by exploiting biased substitution patterns. 
Bioinformatics 2009, 25(1 2):i54-62. 

13. Wong WS, Nielsen R: Detecting selection in noncoding regions of 
nucleotide sequences. Genetics 2004, 167(2):949-958. 

14. Pollard KS, Salama SR, King B, Kern AD, Dreszer T, Katzman S, Siepel A, 
Pedersen JS, Bejerano G, Baertsch R, Rosenbloom KR, Kent J, Haussler D: 
Forces shaping the fastest evolving regions in the human genome. PLoS 
Genet 2006, 2(10):e168. 

15. Prabhakar S, Noonan JP, Paabo S, Rubin EM: Accelerated evolution of 
conserved noncoding sequences in humans. Science 2006, 314(5800):786. 

16. Bird CP, Stranger BE, Liu M, Thomas DJ, Ingle CE, Beazley C, Miller W, 
Hurles ME, Dermitzakis ET: Fast-evolving noncoding sequences in the 
human genome. Genome Biol 2007, 8(6):R1 18. 

17. Haygood R, Fedrigo O, Hanson B, Yokoyama KD, Wray GA: Promoter 
regions of many neural- and nutrition-related genes have experienced 
positive selection during human evolution. Nat Genet 2007, 

39(9):1 140-1 144. 

18. Kim SY, Pritchard JK: Adaptive evolution of conserved noncoding 
elements in mammals. PLoS Genet 2007, 3(9):1 572-1 586. 

19. Arbiza L, Dopazo J, Dopazo H: Positive selection, relaxation, and 
acceleration in the evolution of the human and chimp genome. PLoS 
Comput Biol 2006, 2(4):e38. 

20. Enard D, Depaulis F, Roest Crollius H: Human and non-human primate 
genomes share hotspots of positive selection. PLoS Genet 2010, 6(2): 
e 1000840. 

21. Messier W, Stewart CB: Episodic adaptive evolution of primate lysozymes. 
Nature 1997, 385(661 2):1 51 -154. 

22. Morgan CC, Loughran NB, Walsh TA, Harrison AJ, O'Connell MJ: Positive 
selection neighboring functionally essential sites and disease-implicated 
regions of mammalian reproductive proteins. BMC Evol Biol 2010, 10:39. 

23. Yang Z: Likelihood ratio tests for detecting positive selection and 
application to primate lysozyme evolution. Mol Biol Evol 1998, 
15(5):568-573. 

24. Pollard KS, Hubisz MJ, Rosenbloom KR, Siepel A: Detection of nonneutral 
substitution rates on mammalian phylogenies. Genome Res 2010, 

20(1 ):1 10-121. 

25. Schmid K, Yang Z: The trouble with sliding windows and the selective 
pressure in BRCA1. PLoS One 2008, 3(11):e3746. 

26. Siepel A, Pollard KS, Haussler D: New methods for detecting lineage- 
specific selection. Proceedings of the 10th International Conference on 
Research in Computational Molecular Biology (RECOMB 2006) 2006, 190-205. 

27. DivE software. [http://cbcb.umd.edu/software/DivE/]. 

28. Cargill M, Altshuler D, Ireland J, Sklar P, Ardlie K, Patil N, Shaw N, Lane CR, 
Lim EP, Kalyanaraman N, Nemesh J, Ziaugra L, Friedland L, Rolfe A, 
Warrington J, Lipshutz R, Daley GQ, Lander ES: Characterization of single- 
nucleotide polymorphisms in coding regions of human genes. Nat Genet 
1999, 22(3):23 1-238. 

29. Siepel A, Haussler D: Phylogenetic estimation of context-dependent 
substitution rates by maximum likelihood. Mol Biol Evol 2004, 
21(3):468-488. 

30. The ENCODE Project Consortium: The ENCODE (ENCyclopedia Of DNA 
Elements) Project. Science 2004, 306(5696):636-640. 

31. Margulies EH, Cooper GM, Asimenos G, Thomas DJ, Dewey CN, Siepel A, 
Birney E, Keefe D, Schwartz AS, Hou M, Taylor J, Nikolaev S, Montoya- 
Burgos Jl, Loytynoja A, Whelan S, Pardi F, Massingham T, Brown JB, Bickel P, 
Holmes I, Mullikin JC, Ureta-Vidal A, Paten B, Stone EA, Rosenbloom KR, 
Kent WJ, Bouffard GG, Guan X, Hansen NF, Idol JR, et al: Analyses of deep 
mammalian sequence alignments and constraint predictions for 1% of 
the human genome. Genome Res 2007, 17(6):760-774. 

32. Yang Z, Bielawski JP: Statistical methods for detecting molecular 
adaptation. Trends Ecol Evol 2000, 1 5(1 2):496-503. 

33. Yang Z: PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol 
Evol 2007, 24(8):1 586-1 591. 

34. Allen JE, Salzberg SL: JIGSAW: integration of multiple sources of evidence 
for gene prediction. Bioinformatics 2005, 21 (18):3596-3603. 



Pertea et al. BMC Bioinformatics 201 1, 1 2:274 Page 1 3 of 1 3 

http://www.biomedcentral.eom/1 471 -21 05/1 2/274 



35. 



36. 



37. 



38. 



39. 



40. 



41. 



42. 



43. 



44. 



45 



46. 



47. 



49. 



50. 
51. 



52. 



53. 



54. 



55. 



Kosiol C, Vinar T, da Fonseca RR, Hubisz MJ, Bustamante CD, Nielsen R, 
Siepel A: Patterns of positive selection in six Mammalian genomes. PLoS 
Genet 2008, 4(8):e1000144. 

Demogines A, East AM, Lee JH, Grossman SR, Sabeti PC, Paull TT, Sawyer SL: 
Ancient and recent adaptive evolution of primate non-homologous end 
joining genes. PLoS Genet 2010, 6(10):e1001 169. 
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, 
Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, 
Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene 
ontology: tool for the unification of biology. The Gene Ontology 
Consortium. Not Genet 2000, 25(1):25-29. 
Ratnakumar A, Mousset S, Glemin S, Berglund J, Galtier N, Duret L, 
Webster M: Detecting positive selection within genomes: the problem of 
biased gene conversion. Philos Trans R Soc Lond B Biol Sci 2010, 
365(1 552):2571 -2580. 

Schneider A, Souvorov A, Sabath N, Landan G, Gonnet GH, Graur D: 

Estimates of positive Darwinian selection are inflated by errors in 

sequencing, annotation, and alignment. Genome Biol Evol 2009, 1:1 14-118. 

ENCODE Project: Comparative Sequencing, [http://www.nisc.nih.gov/ 

open_page.cgi?path=/projects/encode/index.cgi]. 

Pertea M, Salzberg SL: Between a chicken and a grape: estimating the 

number of human genes. Genome Biol 2010, 11(5):206. 

Perelman P, Johnson WE, Roos C, Seuanez HN, Horvath JE, Moreira MA, 

Kessing B, Pontius J, Roelke M, Rumpler Y, Schneider MP, Silva A, O'Brien SJ, 

Pecon-Slattery J: A molecular phylogeny of living primates. PLoS Genet 

2011, 7(3):e1001342. 

Schrago CG: On the time scale of New World primate diversification. Am 

J Phys Anthropol 2007, 132(3):344-354. 

Blanchette M, Kent WJ, Riemer C, Elnitski L, Smit AF, Roskin KM, Baertsch R, 
Rosenbloom K, Clawson H, Green ED, Haussler D, Miller W: Aligning 
multiple genomic sequences with the threaded blockset aligner. Genome 
Res 2004, 14(4)708-715. 

PHAST: Phylogenetic analysis with space/time models. [http://compgen. 
bscb.cornell.edu/phast/]. 

Tavare S: Some probabilistic and statistical problems in the analysis of 

DNA sequences. Lectures on Mathematics in the Life Sciences 1986, 17:57-86. 

Siepel A, Haussler D: Combining phylogenetic and hidden Markov 

models in biosequence analysis. J Comput Biol 2004, 1 1:(2-3):4 13-428. 

Felsenstein J: Evolutionary trees from DNA sequences: a maximum 

likelihood approach. J Mol Evol 1981, 17(6):368-376. 

Harrow J, Denoeud F, Frankish A, Reymond A, Chen CK, Chrast J, Lagarde J, 

Gilbert JG, Storey R, Swarbreck D, Rossier C, Ucla C, Hubbard T, 

Antonarakis SE, Guigo R: GENCODE: producing a reference annotation for 

ENCODE. Genome Biol 2006, 7(Suppl 1):S4 1-9. 

ENCODE Project at USCS. [http://genome.cse.ucsc.edu/ENCODE/]. 

Wu TD, Watanabe CK: GMAP: a genomic mapping and alignment 

program for mRNA and EST sequences. Bioinformatics 2005, 

21 (9):1 859-1 875. 

Zhou L, Pertea M, Delcher AL, Florea L: Sim4cc: a cross-species spliced 

alignment program. Nucleic Acids Res 2009, 37(11):e80. 

Slater GS, Birney E: Automated generation of heuristics for biological 

sequence comparison. BMC Bioinformatics 2005, 6:31. 

Zhang J, Nielsen R, Yang Z: Evaluation of an improved branch-site 

likelihood method for detecting positive selection at the molecular 

level. Mol Biol Evol 2005, 22(12):2472-2479. 

Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical 
and powerful approach to multiple testing. Journal of the Royal Statistical 
Society B 1995, , 57: 289-300. 



doi:1 0.1 1 86/1 471 -21 05-1 2-274 

Cite this article as: Pertea et al:. Detection of lineage-specific 
evolutionary changes among primate species. BMC Bioinformatics 201 1 

1 2:274. 



Submit your next manuscript to BioMed Central 
and take full advantage of: 

• Convenient online submission 

• Thorough peer review 

• No space constraints or color figure charges 

• Immediate publication on acceptance 

• Inclusion in PubMed, CAS, Scopus and Google Scholar 

• Research which is freely available for redistribution 



Submit your manuscript at 
www.biomedcentral.com/submit 



o 



BioMed Central 



